Map-Reduce Parallelization of Motif Discovery
نویسنده
چکیده
Motif discovery is one of the most challenging problems in bioinformatics today. DNA sequence motifs are becoming increasingly important in analysis of gene regulation. Motifs are short, recurring patterns in DNA that have a biological function. For example, they indicate binding sites for Transcription Factors (TFs) and nucleases. There are a number of Motif Discovery algorithms that run sequentially. The sequential nature stops these algorithms from being parallelized. HOMER is one such Motif discovery tool, that we have decided to use to overcome this limitation. To overcome this limitation, we propose a new methodology for Motif Discovery, using HOMER, that parallelizes the task. Parallelized version can potentially yield better scalability and performance. To achieve this, we have decided to use sub-sampling and the Map Reduce model. At each Map node, a sub-sampled version of the input DNA sequences is used as input to HOMER. Subsampling at each map node is performed with different parameters to ensure that no two HOMER instances receive identical inputs. The output of the map phase and the input of the reduce phase is a list of Motifs discovered using the sub-sampled sequences. The reduce phase calculates the mode, most frequent Motifs, and outputs them as the final discovered Motifs. We found marginal speed gains with this model of execution and substantial amount of quality loss in Discovered Motifs.
منابع مشابه
Development of an Efficient Hybrid Method for Motif Discovery in DNA Sequences
This work presents a hybrid method for motif discovery in DNA sequences. The proposed method called SPSO-Lk, borrows the concept of Chebyshev polynomials and uses the stochastic local search to improve the performance of the basic PSO algorithm as a motif finder. The Chebyshev polynomial concept encourages us to use a linear combination of previously discovered velocities beyond that proposed b...
متن کاملParallelization of genetic algorithms using Hadoop Map/Reduce
In this paper we present parallel implementation of genetic algorithm using map/reduce programming paradigm. Hadoop implementation of map/reduce library is used for this purpose. We compare our implementation with implementation presented in [1]. These two implementations are compared in solving One Max (Bit counting) problem. The comparison criteria between implementations are fitness converge...
متن کاملParallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach
There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...
متن کاملModel selection for sequence patterns
Abstract: In this article we propose a maximal a posteriori (MAP) criterion for model selection in the motif discovery problem and investigate conditions under which the MAP asymptotically gives a correct prediction of model size. We also investigate robustness of the MAP to prior specification and provide guidelines for choosing prior hyper-parameters for motif models based on sensitivity cons...
متن کاملParallel Network Motif Finding
Network motifs are over-represented patterns within a network, and signify the fundamental building blocks of that network. The process of finding network motifs is closely related to the traditional subgraph isomorphism problem in computer science, which finds instances of a particular subgraph in a graph. This problem has been proven NP-complete, and thus even for relatively small subgraphs a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1405.0354 شماره
صفحات -
تاریخ انتشار 2014